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Abstract We provide a unifying framework linking two classes 
of statistics used in two-sample and independence testing: on the one 
hand, the energy distances and distance covariances from the statis- 
tics literature; on the other, Maximum Mean Discrepancies (MMD), 
i.e., distances between embeddings of distributions to reproducing 
kernel Hilbert spaces (RKHS), as established in machine learning. In 
the case where the energy distance is computed with the semimetric of 
negative type, a positive definite kernel, termed distance kernel, may 
be defined such that the MMD corresponds exactly to the energy 
distance. Conversely, for any positive definite kernel, we can inter- 
pret the MMD as energy distance with respect to some negative-type 
semimetric. This equivalence readily extends to distance covariance 
using kernels on the product space. We determine the class of proba- 
bility distributions for which the test statistics are consistent against 
all alternatives. Finally, we investigate the performance of the family 
of distance kernels in two-sample and independence tests: we show 
in particular that the energy distance most commonly employed in 
statistics is just one member of a parametric family of kernels, and 
that other choices from this family can yield more powerful tests. 

1. Introduction. The problem of testing statistical hypotheses in high 
dimensional spaces is particularly challenging, and has been a recent focus 
of considerable work in both the statistics and the machine learning com- 
munities. On the statistical side, two-sample testing in Euclidean spaces (of 
whether two independent samples are from the same distribution, or from 
different distributions) can be accomplished using a so-called energy dis- 
tance as a statistic [Szekely and Rizzo, 2004, 2005]. Such tests are consistent 
against all alternatives as long as the random variables have finite first mo- 
ments. A related dependence measure between vectors of high dimension is 
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the distance covariance [Szekely et al., 2007, Szekely and Rizzo, 2009], and 
the resulting test is again consistent for variables with bounded first moment. 
The distance covariance has had a major impact in the statistics commu- 
nity, with [Szekely and Rizzo, 2009] being accompanied by an editorial in- 
troduction and discussion. A particular advantage of energy distance-based 
statistics is their compact representation in terms of certain expectations of 
pairwise Euclidean distances, which leads to straightforward empirical esti- 
mates. As a follow-up work, Lyons [2011] generalized the notion of distance 
covariance to metric spaces of negative type (of which Euclidean spaces are 
a special case) . 

On the machine learning side, two-sample tests have been formulated 
based on embeddings of probability distributions into reproducing kernel 
Hilbert spaces [Gretton et al., 2012], using as the test statistic the differ- 
ence between these embeddings: this statistic is called the Maximum Mean 
Discrepancy (MMD). This distance measure was also applied to the prob- 
lem of testing for independence, with the associated test statistic being the 
Hilbert-Schmidt Independence Criterion (HSIC) [Gretton et al., 2005, 2008, 
Smola et al., 2007, Zhang et al., 2011]. Both tests are shown to be consistent 
against all alternatives when a characteristic RKHS is used [Pukumizu et al., 
2009, Sriperumbudur et al., 2010]. Such tests can further be generalized to 
structured and non-Euclidean domains, such as text strings, graphs, and 
groups [Fukumizu et al., 2009]. 

Despite their striking similarity, the link between energy distance-based 
tests and kernel-based tests has been an open question. In the discussion of 
[Szekely and Rizzo, 2009], Gretton et al. [2009b, p. 1289] first explored this 
link in the context of independence testing, and found that interpreting the 
distance-based independence statistic as a kernel statistic is not straight- 
forward, since Bochner's theorem does not apply to the choice of weight 
function used in the definition of the distance covariance (we briefiy review 
this argument in Section 4.3). Szekely and Rizzo [2009, Rejoinder, p. 1303] 
confirmed this conclusion, and commented that RKHS-based dependence 
measures do not seem to be formal extensions of the distance covariance 
because the weight function is not integrable. Our contribution resolves this 
question and shows that RKHS-based dependence measures are precisely 
the formal extensions of the distance covariance, where the problem of non- 
integrability of weight functions is circumvented by using translation-variant 
kernels, i.e., distance- induced kernels, a novel family of kernels that we in- 
troduce in Section 2.4. 

In the case of two-sample testing, we demonstrate that energy distances 
are in fact maximum mean discrepancies arising from the same family of 
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distance-induced kernels. A number of interesting consequences arise from 
this insight: first, as the energy distance (and distance covariance) derives 
from a particular choice of a kernel, we can can consider analogous quanti- 
ties arising from other kernels, and yielding a more sensitive test. Second, 
results from [Gretton et al., 2009a, Zhang et al., 2011] may be applied to 
get consistent two-sample and independence tests for the energy distance, 
without using bootstrap, which perform much better than the upper bound 
proposed by Szekely et al. [2007] as an alternative to the bootstrap. Third, 
in relation to Lyons [2011], we obtain a new family of characteristic kernels 
arising from semimetric spaces of negative type (where the triangle inequal- 
ity need not hold), which are quite unlike the characteristic kernels defined 
via Bochner's theorem [Sriperumbudur et al., 2010]. 

The structure of the paper is as follows: in Section 2, we provide the nec- 
essary definitions from RKHS theory, and the relation between RKHS and 
semimetrics of negative type. We also give a review of the principles behind 
the Maximum Mean Discrepancy (MMD) and the Hilbert-Schmidt Inde- 
pendence Criterion (HSIC), the RKHS-based statistics used for two-sample 
and independence testing, respectively. In Sections 3 and 4, we review the 
notions of energy distance and distance covariance in general (semi)metric 
spaces of negative type, and show them to be special cases of MMD and 
HSIC, respectively. We give conditions for these quantities to distinguish 
between probability measures in Section 5, thus obtaining a new family of 
characteristic kernels. Empirical estimates of these quantities and associ- 
ated two-sample and independence tests are described in Section 6. Finally, 
in Section 7, we investigate the performance of the test statistics on a vari- 
ety of testing problems, which demonstrate the strengths of the new kernel 
family. 

This paper extends the conference publication [Sejdinovic et al., 2012], 
and gives a detailed technical discussion and proofs which were omitted in 
that work. 

2. Kernels and (semi)metrics. In this section, we introduce concepts 
and notation required to understand reproducing kernel Hilbert spaces (Sec- 
tion 2.1), and distribution embeddings into RKHS. We then introduce semi- 
metrics of negative type (Section 2.3), and reveal their relation to RKHS 
kernels. 

2.1. RKHS and kernel embeddings. Unless stated otherwise, we will as- 
sume that Z is any topological space. We will denote by M{Z) the set of 
all finite signed Borel measures on Z, and by A4^{Z) the set of all Borel 
probability measures on Z. 
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Definition 1. (RKHS) Let T-L he a Hilbert space of real- valued func- 
tions defined on -E. A function k : Z x Z —^M is called a reproducing kernel 
of n if: 

1. Vz G Z, k{-,z) S H, and 

2. yzGZ,yfGn, {f,k{;z))^ = f{z). 

If 7i has a reproducing kernel, it is said to be a reproducing kernel Hilbert 
space (RKHS). 

According to the Moore- Aronszajn theorem [Berlinet and Thomas- Agnan, 
2004, p. 19], for every symmetric, positive definite function (henceforth ker- 
nel) k : Z X Z ^ M., there is an associated RKHS Hk of real- valued functions 
on Z with reproducing kernel k. The map ip : Z ^ 'H/t, ip : z ^ k{-,z) 
is called the canonical feature map or the Aronszajn map of k. We will 
say that A; is a nondegenerate kernel if its Aronszajn map is injective. 
The notion of feature map can be extended to kernel embeddings of fi- 
nite signed Borel measures on Z [Smola et al., 2007, Sriperumbudur et al., 
2010], [Berlinet and Thomas-Agnan, 2004, Chapter 4]. 

Definition 2. (Kernel embedding) Let A; be a kernel on Z, and u € 
A4{Z). The kernel embedding of u into the RKHS Hk is /xa:(j^) G Tik such 
that Jf{z)di^{z) = {f,fik{^))^^ for all / G Tik. 

Alternatively, the kernel embedding can be defined by the Bochner inte- 
gral fJ-ki'^) = J k{-, z) di'{z). If a measurable kernel A; is a bounded function, 
/ifc(i^) exists for all G A4{Z). On the other hand, if k is not bounded, there 
will always exist G Ai{Z), for which J k{-,z)dv{z) diverges. The kernels 
we will consider in this paper will be continuous, and hence measurable, but 
unbounded, so kernel embeddings will not be defined for some finite signed 
measures. Thus, we need to restrict the attention to a particular class of 
measures for which kernel embeddings exist (this will be later shown to re- 
flect the condition that random variables considered in distance covariance 
tests must have finite moments). Let A: be a measurable kernel on Z, and 
denote, for > 0, 

(2.1) Mi{Z) = i^u M{Z) : j k\z,z)d\u\{z) <ooY 
Clearly, 

(2.2) 01 < 02 =^ Ml^{Z) C Ml^{Z). 
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Note that the kernel embedding /Ufc(i/) is well defined \lv G M.^ {Z), by 
the Riesz representation theorem. Indeed, : f ^ j f{z) du{z) is then a 
bounded linear functional on T-Lk, which can be seen using the reproducing 
property and the Cauchy-Schwarz inequality, 

nf = J f{z)dv{z)= J {f,k{;z))^^ du{z)<\\f\\^J k'/\z,z)dW\{z). 

Thus, kernel embeddings of Borel probability measures in the space Ai\,{Z)r\ 

1/2 

AdfJ (Z) do exist, and we introduce the notion of distance between Borel 
probability measures in this space using the Hilbert space distance between 
their embeddings. 

Definition 3. (Maximum mean discrepancy) Let A; be a kernel on 
Z, and let P, Q G M\iZ) n mI^'^{Z). The maximum mean discrepancy 
(MMD) 7fc between P and Q is given by Gretton et al. [2012, Lemma 4], 

7fc(P,g) = \MP)-^ikm\H,■ 

The following alternative representation of the squared MMD [from Gretton et al. , 
2012, Lemma 6] will be useful 

7^(P,g) = ^zz'k{Z,Z') + ¥.ww'k{W,W')-2-Kzwk{Z,W) 

(2.3) = j j kd{[P-Q\ X [P-Q]), 

where Z, Z' P and W,W' Q. If the restriction of fik to some 
V{Z) C M\[Z) is well defined and injective, then k is said to be char- 
acteristic to V{Z), and it is said to be characteristic (without further qual- 
ification) if it is characteristic to A4^{Z). When k is characteristic, 7^ is 
a metric on the entire MX{Z), i.e., jkiP,Q) = iff P = Q, VP,(5 G 
J\4.\{Z). Conditions under which kernels are characteristic have been studied 
in [Sriperumbudur et al., 2008, Fukumizu et al., 2009, Sriperumbudur et al., 
2010]. An alternative interpretation of (2.3) is as an integral probability met- 
ric [Miiller, 1997], 

(2.4) lk{P,Q)= sup [W.Zr.pf{Z)-W.Wr.Qf{W)]. 

/eWfc,||/||„^<i 

See [Gretton et al., 2012] for details. 
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2.2. Hilbert Schmidt Independence Criterion (HSIC). The MMD can 
be employed to measure statistical dependence between random variables 
[Gretton et al., 2005, 2008, Smola et al., 2007, Zhang et al., 2011]. Let X 
and y be two non-empty topological spaces and let kx and ky be kernels 
on X and 3^, with respective RKHSs T-Lkx ™d 'Hky Then, by applying 
[Steinwart and Christmann, 2008, Lemma 4.6, p. 114], 

(2.5) k{{x,y) ,{x' ,y')) = kx{x,x')ky{y,y') 

is a kernel on the product space X x y with RKHS T-Lk isometrically iso- 
morphic to the tensor product T-Lkx ® '^ky ■ 

Definition 4. Let X ~ Px and y '--^ Py be random variables on X 
and y, respectively, having joint distribution Pxy- Furthermore, let /c be a 
kernel on x 3^, given in (2.5). The Hilbert-Schmidt Independence Criterion 
(HSIC) of X and Y is the MMD 7^ between the joint distribution Pxy and 
the product of its marginals PxPy- 

Following [Smola et al., 2007, Section 2.3], we can expand HSIC as: 

(2.6) jI{Pxy,PxPy) 

= W^J-kiPxY) - l^k(PxPY)\\n^ 

= \\ExY [kx{;X) ky{; Y)] - Exkxi; X) ® EYky{;Y)\\l^^ 
= ExYEx'Y'kx{X,X')ky{YX)+ExEx'kx{X,X')EYEY'ky{Y,Y') 
-2Kx'Y' [Exkx{X,X')EYky{YX)\ 

where in the last step we used that: 

It can be shown that this quantity is equal to the squared Hilbert-Schmidt 
norm of the covariance operator between RKHSs [Gretton et al., 2005]. 
We claim that 'jKPxy, PxPy) is well defined as long as Px G Ml^{X) 
and Py G -^kyi^)- Iiideed, the embedding ^^{PxPy) of the product of 
marginals can be identified with the tensor product fikxiPx) ® fJ-kyiPY), 
where ^kx{Px) exists since Px G M\^{X) C M^J^{X), and fiky(PY) ex- 
ists since Py G -^fc-y(3^) C ■M.]/'^{y)- Furthermore,/ifc(Pxy) exists since 
1/2 

Pxy GA^fc ('^x3^),which can be seen from the Cauchy-Schwarz inequal- 
ity, 
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k^^^ {{x,y) ,{x,y))dPxY{x,y) = J k^J^{x,x)kll'^{y,y)dPxY{x,y) 

< ( [ kx{x,x)dPx{x) [ ky{y,y)dPY{y) 



2.3. Semimetrics of negative type. We will work with the notion of semi- 
metric of negative type on a non-empty set Z, where the "distance" func- 
tion need not satisfy the triangle inequality. Note that this notion of semi- 
metric is different to that which arises from the seminorm (also called the 
pseudonorm), where the distance between two distinct points can be zero. 

Definition 5. (Semimetric) Let Z he a non-empty set and let p : 

Z X Z ^ [0, oo) be a function such that Vz, z' G Z, 

1. p{z, z') = if and only \i z = z' , and 

2. p{z,z') = p{z\z). 

Then [Z, p) is said to be a semimetric space and p is called a semimetric 
on Z. If, in addition, 

3. Vz, z\ z" € Z, p{z', z") < p{z, z') + p{z, z"), 

{Z, p) is said to be a metric space and p is called a metric on Z. 

Definition 6. (Negative type) The semimetric space {Z, p) is said 
to have negative type if Vn > 2, zi, . . . ,Zn S Z, and ai, . . . , a„ G M, with 

TJi=i = o> 

n n 

(2.7) ^ ^ aiajp{zi, zj) < 0. 

i=i j=i 

Note that in the terminology of Berg et al. [1984], p satisfying (2.7) is said 
to be a negative definite function. The following proposition is derived from 
[Berg et al., 1984, Corollary 2.10, p. 78, and Proposition 3.2, p. 82]. 

Proposition 7. 

1. If p satisfies (2.7), then so does p'^ , for Q < q <1. 

2. p is a semimetric of negative type if and only if there exists a Hilhert 
space H and an injective map ip : Z ^ H, such that 



1/2 



(2.8) p{z,z') = \Uz)-ip{zr' 



n ■ 
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The second part of the Proposition shows that (M'^, ||- — -H^) is of negative 
type, and by taking (7 = 1/2 in the first part, we conclude that all Euclidean 
spaces are of negative type. In addition, whenever p is a semimetric of neg- 
ative type, /9^/^ is a metric of negative type, i.e., even though p may not 
satisfy the triangle inequality, its square root must do if it obeys (2.7). 

2.4. Distance kernels. Lyons [2011, p. 9] also uses the results in Propo- 
sition 7 to study embeddings to general Hilbert spaces, however the rela- 
tion with the theory of reproducing kernel Hilbert spaces is not exploited. 
Semimetrics of negative type and symmetric positive definite kernels are in 
fact closely related, as summarized in the following Lemma, adapted from 
[Berg et al, 1984, Lemma 2.1, p. 74]. 

Lemma 8. Let Z he a nonempty set, and p : Z ^ Z a semimetric 
on Z. Let zq G Z, and denote k{z, z') = p{z, zq) + p{z' , zo) — p{z, z'). Then 
k is positive definite if and only if p satisfies (2.7). 

As a consequence, k{z,z') defined above is a valid kernel on Z whenever 
p is a semimetric of negative type. For convenience, we will work with such 
kernels scaled by 1/2. 

Definition 9. (Distance-induced kernel) Let p be a semimetric of 
negative type on Z and let zq (z Z. The kernel 

(2.9) k{z, z') = ^ [p{z, zo) + p{z', zo) - p{z, z')] 

is said to be the distance-induced kernel induced by p and centred at Zo- 

For brevity, we will drop "induced" hereafter, and say that k is simply 
the distance kernel (with some abuse of terminology). Note that distance 
kernels are not strictly positive definite, i.e., it is not true that Vn € N, and 
for distinct zi, . . . , Zn £ Z, 

n n 

ajaj/c(zj, Zj) = = Vi. 

1=1 j=i 

Indeed, if k were given by (2.9), it would suffice to take n = 1, since 
k(zo, Zo) = 0. By varying the point at the center zo, we obtain a family 

of distance kernels induced by p. We may now express (2.8) from Proposition 
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7 in terms of the canonical feature map for the RKHS Hk- 

Proposition 10. Let {Z,p) be a semimetric space of negative type, and 
k € ICp. Then: 

1. k is nondegenerate, i.e., the Aronszajn map z i— )■ k{-,z) is injective. 

2. p{z,z') = k{z,z) + k{z',z') - 2k{z,z') = \\k{-,z) - k{-,z')\\l^^ . 

Proof. If z,z' ^ Z are such that k(w,z) = k{w,z'), for ah w (z Z, we 
would also have p{z,zq) — p{z,w) = p{z',zq) — p{z',w), for all w G Z. In 
particular, by inserting w = z, and w = z', we obtain p{z,z') = —p{z,z') = 
0, i.e., z = z'. The second statement follows readily by expressing k in terms 
of p. □ 

Example 11. Let Z C R'^ and write pq{z,z') = Wz — z'W^. By Propo- 
sition 7, pq is a valid semimetric of negative type for < g < 2. The 
corresponding kernel centered at zq = is given by: 

(2.10) A;,(z,/) = ^(lkr + ||^'ir-||^-^'ir)- 

2.5. Semimetrics generated by kernels. We now further develop the link 
between semimetrics of negative type and kernels. We start with a simple 
corollary of Proposition 7: 

Corollary 12. Let k be any nondegenerate kernel on Z. Then, 

(2.11) p{z, z') = k{z, z) + k{z', z') - 2k{z, z') 

defines a valid semimetric p of negative type on Z. 

Definition 13. (Equivalent kernels) Whenever the kernel k and 
semimetric p satisfy (2.11), we will say that k generates p. If two kernels 
generate the same semimetric, we will say that they are equivalent kernels. 

It is clear that every distance kernel G /Cp induced by p, also generates 
p. However, there are many other kernels that generate p, as illustrated in 
the following example. 
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Example 14. Let Z = M"^, and consider the Gaussian kernel k{z,z') 



,'l|2 



/||2 



The 



/„2 . „,UI|2 ,||2 



g o-yz 2 II ^ rpj^g induced semimetric is p{z,z') = 2 

Gaussian kernel is equivalent to z') = e~'^l'^~^'ll +1 — e"'^"^" — e 
Note that k S /Cp, while k ^ /Cp. 

Proposition 15. Let k and k be two kernels on Z. k and k are equiv- 
alent if and only if k{z, z') = k{z, z') + f{z) + f{z'), for some shift function 
f : Z ^R. 

Proof. Let k{z, z') = k{z, z') + f{z) + f{z'). Then, 

k{z,z) + k{z',z')-2k{z,z') = k{z,z) +2fiz) + k{z',z') + 2fiz') 

-2[kiz,z')-fiz)-fiz')] 
= k{z,z)+k{z',z')-2k{z,z'). 

The converse is clear from writing h = k — k, whereby if k and k are equiv- 
alent, it follows that h{z, z') = ^ {h{z, z) + h{z' , z')), for all z, z' G Z. □ 

Not every choice of shift function / in Proposition 15 will be valid, as 
both k and k are required to be positive definite. An important class of shift 
functions can be derived using RKHS functions, however. Namely, let k be 
a kernel on Z and let f (z Hk, and define a kernel 



kfiz,z') = {k{.,z)-f,ki;z')-f) 

= k{z,z')-f{z)-f{z') 



n. 



Since it is representable as an inner product in a Hilbert space, /cj is a 
valid kernel which is equivalent to k by Proposition 15. As a special case, if 
/ = ^fc(-P) for some P € M.\{Z), we obtain the kernel centred at probability 
measure P: 

(2.12) kp{z, z') := k{z, z') + EvKH/'A:(W, W') - Eh/A:(z, W) - Evi/A;(z', W), 

with W,W' P. Note that ^,,.,~kp{Z,Z') = 0, i.e., /ir (P) = 0. 
The kernels of form (2.12) that are centred at the point masses P = 5zq are 
precisely the distance kernels equivalent to k. 

The relationship between positive definite kernels and semimetrics of neg- 
ative type is illustrated in Figure 1. 



EQUIVALENCE OF DISTANCE-BASED AND RKHS-BASED STATISTICS 11 



equivalent kernels 




the set of PD kernels onZ-.Hf^^ 



Figure 1. The relationship between kernels and semimetrics 

Remark 16. The requirement that kernels be characteristic (as intro- 
duced below Definition 3) is clearly important in hypothesis testing. A sec- 
ond family of kernels, widely used in the machine learning literature, are 
the universal kernels: universality can be used to guarantee consistency of 
learning algorithms [Steinwart and Christmann, 2008]. A relation between 
universal and characteristic kernels is described in Appendix B. 

2.6. Existence of kernel embedding through a semimetric. In Section 2.1, 
we have seen that a sufficient condition for the kernel embedding iiki^) of 

1/2 

V € A4.{Z) to exist is that u G Aiy! (Z). We will now interpret this condition 
in terms of the semimetric p generated by k. 

Definition 17. For 6* > 0, we say that u G M{Z) has a finite 6'-moment 
with respect to a semimetric p of negative type if there exists zq € Z, such 
that / p^{z, Zq) dl^Kz) < oo. We denote 

(2.13) M%Z) = {u€ M{Z) : 3zo G Z, 

S.t. 1 p^ {z, Zq) d\v\{z) < oo}. 
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We now relate the space of measures with finite ^-moment with respect 
to p with space 

Proposition 18. Let k he a kernel that generates semimetric p, and let 
n G N. Then M^^'^{Z) = M^J'^{Z). In particular, if ki and k^ generate the 
same semimetric p, then M^^^{Z) = M^^^{Z). 

Proof. Let (9 > i. Suppose v e Ml{Z). Then we have 

p%z,z^)d\y\{z) 

||fc(,z)-fc(,zo)||?f,d|HW 
< / m;^)\W + \\k{:^^)\\nS' d\u\{z) 



< 2 



26-1 



J \\k{;Zo)r^JW\{z) 

k'{z,z)dW\{z) + k\zo,zo)W\iZ) 



< CXO, 

where we have used that a^^ is a convex function of a. From the above it is 
clear that Ml{Z) C M^p{Z) for > 1/2. 

To prove the other direction, we show by induction that Mp{Z) C M^^'^iZ) 
for 6* > §, n G N. Let n = 1. Let 6* > i, and suppose that u G A4^p{X). 
Then, by invoking the reverse triangle and Jensen's inequalities, we have: 



p^{z,zo)d\i^\ (z) 



<z) - k{-,zo)\\^ d\i^\{z) 



> 



> 



k'^\z,z)-k'/\zo,zo) d\u\{z) 
k^/\z,z)dH(z) - M^yk^/\zo,zo) 



29 



1 /2 

which implies v G Ai/, (Z), thereby satisfying the result for n = 1. Suppose 
the result holds for 9 > i.e., A^^(Z) C M^^~^'^^^{Z) for 9 > Let 
u G M^iZ) for 6* > f . Then we have 
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p {z,ZQ)d\v\{z) 

k{-,z)-k{;zo)r^y-^ d\v\{z) 

> ( I \\k{;z)-k{;ZoW^^dW\{z) 

!fc(,z)||«,-||fc(,zo)||„j"d|HW 

fc(,z)||„,-||fc(,zo)l|Kj"d|HW 

/ E(-ir ( " ) m-,z)\\!}fj\\ki;zoWnjw\{z) 



> 



> 



A: 2 (z, z) d\i'\{z) 



j^i-iy ( ; ) fc5(zo,zo) / k^{z,z)d\u\{z) 



Note that the terms in B are finite since for ^ > ^ > > • • • > 2 ' 
have C C • • • C Al^(^) C and therefore A 

is finite, which means u G A^^/^(Z), i.e., M%Z) C for > §. 

The result shows that = Mi{Z) for ah G {§ : n G N}. □ 

The above Proposition gives a natural interpretation of conditions on 
probability measures in terms of moments w.r.t. p. Namely, the kernel em- 
bedding /ifc(P), where kernel k generates the semimetric /?, exists for every 
P with finite half-moment w.r.t. /?, and thus the MMD ^k{P^Q) between 
P and Q is well defined whenever both P and Q have finite half-moments 
w.r.t. p. Furthermore, HSIC between random variables X and Y is well de- 
fined whenever their marginals Px and Py have finite first moments w.r.t. 
semimetric px and py generated by kernels kx and ky on their respective 
domains X and y. 

3. Equivalence of MMD and energy distance. In this section, we 
begin with a review of the energy distance, which measures distance be- 
tween distributions, and demonstrate that it is an instance of the maximum 
mean discrepancy introduced in Definition 3. The energy distance was intro- 
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duced by Szekely and Rizzo [2004, 2005] as a measure of statistical distance 
between two probability measures P and Q on R'^ with finite first moments, 

(3.1) De{P, Q) = 2Ezw \\Z - W\\ - \\Z - Z'\\ - E^' \\W - W'\\ , 

where Z, Z' P and W, W Q. This quantity is always nonnegative, 
and is strictly positive ii P ^ Q. In scalar case, it coincides with twice the 
Cramer- Von Mises distance. 

Following Lyons [2011], the notion is readily generalized to a semimetric 
space of negative type. 

Definition 19. Let {2,p) be a semimetric space of negative type, and 
let P,Q £ A4p{Z). The energy distance between P and Q, w.r.t. p is 

(3.2) De{P, Q) = 2Ezwp{Z, W) - Ezz'P{Z, Z') - Eww'p{W, W), 

where Z, Z' ''^ ' P and W, W ''^ ' Q. 

The moment condition is required in order to ensure that each of the 
expectations in (3.2) is finite. Note that the energy distance can equivalently 
be represented in the integral form, 

(3.3) DeAP,Q) = - j I pd{[P-Q]x[P-Q]), 

whereby the negative type of p implies the non-negativity of De,p- 

We are now able to show that for every p, De,p is related to the MMD 
associated to a kernel k that generates p. 

Theorem 20. Let {Z, p) he a semimetric space of negative type and let 
k be any kernel that generates p. Then 

DeAP^Q) = 27fc(^,Q), VP,Q G Ml{Z)nMl{Z). 
In particular, equivalent kernels have the same maximum mean discrepancy. 

Proof. Since k generates p, we can write p{z,w) = k{z,z) + k{w,w) — 
2k{z,w). Denote v = P — Q. Then, 



P>E,p{P,Q) = - I I p{z,w)dv{z)dv{w) 

[k{z, z) + k{w, vu) — 2k{z, w)] du{z) dv{w) 

2iliP,Q). 
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where we used the fact that I'iZ) = 0. □ 

In its basic form, this result also appears in [Lyons, 2011, p. 11, Eq. (3.9)], 
but since Lyons works with embeddings into general Hilbert spaces, the link 
to the RKHS-based statistics and MMD in particular is obscured. Theorem 
20 shows that all kernels that generate the same semimetric p on Z give 
rise to the same metric 7fc on (possibly a subset of) A4\.{Z), whence is 
merely an extension of metric induced by p^^'^ on the point masses, since 

'yk{Sz,Sz') = ||A;(-,2;) - A;(-,2;')||^^ 

In other words, whenever kernel k generates p, z 6z is an isometry be- 
tween {Z,p^/'^) and {5z ■■ z e Z} C M\{Z), endowed with the MMD metric 

1 1 /2 

7fc = p'^ ^^'^ Aronszajn map z i— )• k{-,z) is an isometric embedding 
of a metric space {Z,p^/'^) into Hk- These isometrics are depicted in Figure 
2. For simplicity, we show the case of a bounded kernel, where kernel em- 
beddings are well defined for all P G A4^{Z), in which case (A4^{Z),ji^) 
and pk {M\{Z)) = {pk{P) ■ P G M\{Z)} endowed with the Hilbert-space 
metric inherited from T-Lk are also isometric (note that this implies that the 
corresponding subsets of RKHSs corresponding to equivalent kernels are also 
isometric) . 

Remark 21. Theorem 20 requires that P,Q € M],{Z), i.e., that P and 
Q have finite first moments w.r.t. p, as otherwise the energy distance between 
P and Q may be undefined; e.g., each of the expectations "Kzz' p{Z, Z')^ 
"^ww piWi'^') and Kzwp{Z,W) may be infinite. However, as long as a 

1 /2 

weaker condition P,Q£ Adp [Z] is satisfied, i.e., P and Q have finite half- 
moments w.r.t. p, the maximum mean discrepancy jk will be well defined. 
If, in addition, P,Q £ Aip{Z), then the energy distance between P and Q 
is also well defined, and must be equal to 7^. We will later invoke the same 
condition P,Q £ AA.\.{Z) when describing the asymptotic distribution of the 
empirical maximum mean discrepancy in Section 6. 

4. Distance covariance through kernel embeddings. A related 
notion to the energy distance is that of distance covariance, which measures 
dependence between random variables. In this section, we review distance 
covariance and show that it is an instance of the Hilbert-Schmidt indepen- 
dence criterion. 
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Figure 2. Isometries relating the semimetric p on Z with the RKHS corresponding to a 
kernel k that generates p, and with the set of probability measures on Z. 

4.1. Distance covariance. Let X be a random vector on W and Y a ran- 
dom vector on M"^. The distance covariance was introduced by Szekely et al. 
[2007], Szekely and Rizzo [2009] to address the problem of testing and mea- 
suring dependence between X and Y in terms of a weighted L2-distance 
between characteristic functions of the joint distribution of X and Y and 
the product of their marginals. As a particular choice of weight function is 
used (we discuss this further in Section 4.3), it can be computed in terms of 
certain expectations of pairwise Euclidean distances, 

(4.1) V^(X,y) = ExY^X'Y'\\X - X'\\\\Y -Y'W 

+ ExEx' \\X - X'W EyEy/ \\Y - Y'W 

-2KxY [Ex' ||x-x'||Ey/ ||y-y'||] , 

where {X,Y) and {X' ,Y') are Pxv- As in the case of the energy dis- 
tance, Lyons [2011] established that the generalization of the distance co- 
variance is possible to metric space of negative type. 

Definition 22. Let {X,px) and {y,py) be semimetric spaces of nega- 
tive type, and let X ~ G M'i^{X) and F ~ Py G Mi {y), having joint 
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distribution PxY- The generalized distance covariance of X and Y is 

(4.2) V\X,Y) = ExY^x'Y'Px{X,X')py{Y,Y') 

+ ExEx'Px{X,X')py{Y,Y') 

-2ExY [Ex'Px{X,X')EY'Py{Y,Y')] ■ 

As with the MMD, the moment condition ensures that the expectations 
are finite (which can be seen using Cauchy-Schwarz inequahty). Equivalently, 
the generalized distance covariance can be represented in integral form, 

(4-3) Vl^p^iX,Y) = pxPyd{[PxY - PxPy] x [Pxy - PxPy]) , 

where pxPy is viewed as a function on (,^ x 3^) x {X xy). Furthermore, 
Lyons [2011, Theorem 3.20] shows that distance covariance in a metric space 
characterizes independence (i.e., V^^ ^^(X, y) = if and only if X and Y 
are independent) if the metrics px and py satisfy an additional property, 
termed strong negative type. The discussion of this property is relegated to 
Section 5. 

4.2. Equivalence between HSIC and distance covariance. The following 
theorem demonstrates the link between HSIC and the distance covariance. 

Theorem 23. Let (X,px) and {y,py) be semimetric spaces of negative 
type, and let X ~ Px € ■^p;t.('^) o^^^ Y ~ Py G -A^p-^(3^); having joint dis- 
tribution Pxy ■ Let kx and ky be any two kernels on X and y that generate 
Px and Py, respectively, and denote 

(4.4) k iix, y) , ix', y')) = kx{x, x')ky{y, y'). 

Then, VI^^^{X,Y)=A^I{Pxy.PxPy). 

Proof. Put v = Pxy - PxPy- Then 
Vl^^p^{X,Y) = f j px{x,x')py{y,y')dv{x,y)du{x',y') 

j {kx{x, x) + kx{x', x') - 2kx{x, x')) ■ 
{ky{y, y) + ky{y\ y') - 2ky{y, y')) du{x, y) du{x', y') 

= 4-fl{PxY,PxPY), 
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where we used that i^(<%'x 3^) = 0, and that / g{x,y,x',y')di^{x,y)di'{x',y') = 

when g does not depend on one or more of its arguments, since u also has 

zero marginal measures. Convergence of integrals of the form J kx{x, x)ky{y, y) du{x 



We remark that a similar result to Theorem 23 is given by Lyons [2011, 
Proposition 3.16], but without making use of the link with kernel embed- 
dings. Theorem 23 is a more general statement, in the sense that we allow 
p to be a semimetric of negative type, rather than metric. In addition, the 
kernel interpretation leads to a significantly simpler proof: the result is an 
immediate application of the HSIC expansion in (2.6). 

Remark 24. As in Remark 21, to ensure the existence of the distance 
covariance, we impose a stronger condition on the marginals: Px € M\^{X) 
and Py G ^^^^(3^), while Px G Ml^{X) and Py G Ml^iy) are sufficient 
for the existence of the Hilbert-Schmidt independence criterion. 

Remark 25. As introduced by Szekely et al. [2007], the notion of dis- 
tance covariance extends naturally to that of distance variance V^(X) = 
V^(X, X) and of distance correlation (by analogy with the Pearson product- 
moment correlation coefficient), 



The distance correlation can also be expressed in terms of associated kernels 
- see Appendix A for details. 

4.3. Characteristic function interpretation. The distance covariance in 
(4.1) was defined by [Szekely et al., 2007] in terms of a weighted distance 
between characteristic functions. We briefiy review this interpretation here, 
however we show that this aproach cannot be used to derive a kernel-based 
measure of dependence (this result was first noted by Gretton et al. [2009b], 
and is included here in the interest of completeness). Let X be a random 
vector on X =W and Y a random vector on 3^ = M''. The characteristic 
function of X and Y , respectively, will be denoted by fx and /y, and their 
joint characteristic function by fxY- The distance covariance V{X,Y) is 
defined via the norm of fxv — fxfv in a weighted L2 space on R^"'"'?, i.e.. 



is ensured by the moment conditions on the marginals. 



□ 




V{X)V{Y) > 0, 
V{X)V{Y) = 0. 



(4.5) V\X,Y)= f \fx,Y(.t,s)-fx{t)fY{s)fw{t,s)dtds, 

Jrp+1 
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for a particular choice of weight function given by 
(4.6) w{t,s) ^ ^ 



cpc, iitii'+ni^ir^'' 

where Cd = 7r~5~ /r(i±^)^ d> 1. An important aspect of distance covariance 
is that V{X, Y) = if and only if X and Y are independent. We next 
obtain a similar statistic in the kernel setting. Write Z = X x y, and let 
k{z,z') = k{z — z') be a translation invariant RKHS kernel on where 
K : 2 — > M is a bounded continuous function. Using Bochner's theorem, k 
can be written as: 



k{z) = j e-^''"(iA(n) 



for a finite non-negative Borel measure A. It follows [Gretton et al., 2009b] 
that 

iI{Pxy,PxPy) = / \Ix.Y{t,s)- fx{t)fY{s)\^ dAit,s), 

which is in clear correspondence with (4.5). However, the weight function in 
(4.6) is not integrable — so we cannot find a continuous translation invariant 
kernel for which 7^ coincides with the distance covariance. By contrast, note 
the kernel in (4.4) is not translation invariant. 

5. Distinguishing probability distributions. Theorem 3.20 of Lyons 
[2011] shows that distance covariance in a metric space characterizes inde- 
pendence if the metrics satisfy an additional property, termed strong negative 
type. We will extend this notion to a semimetric p, establish the interpreta- 
tion of strong negative type in terms of kernels, and show how the strong 
negative type of a semimetric p can be established by considering whether 
the kernel k that generates p is characteristic. 

Definition 26. The semimetric space {Z,p), where p is generated by 
kernel k, is said to have a strong negative type if VP, Q € M.\{Z) n J\4.\{Z), 

(5.1) P^Q^ J pd{[P-Q]x[P-Q])<0. 

Since the quantity in (5.1) is exactly -2-fl{P,Q), yP,Q S M\{Z) n 
A4l.{Z), we directly obtain: 

Proposition 27. Let kernel k generate p. Then (Z, p) has a strong 
negative type if and only if k is characteristic to M.\[Z) n Ml.{Z). 
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Thus, the problems of checking whether a semimetric is of strong negative 
type and whether its associated kernel is characteristic to an appropriate 
space of Borel probability measures are equivalent. This conclusion has some 
overlap with [Lyons, 2011]: in particular. Proposition 27 is stated in [Lyons, 
2011, Proposition 3.10], where the barycenter map /3 is a kernel embedding in 
our terminology, although Lyons does not consider distribution embeddings 
in an RKHS. 

Remark 28. From [Lyons, 2011, Theorem 3.25], every separable Hilbert 
space Z is of strong negative type, so a distance kernel k induced by the 
(inner product) metric on Z is characteristic to the appropriate space of 
probability measures. 

Remark 29. Consider the kernel in (4.4), and let us, for simplicity, as- 
sume that kx and ky are bounded, so that we can consider embeddings 
of all probability measures. It turns out that k need not be characteristic 
- i.e., it may not be able to distinguish between any two distributions on 
X y^y ^ even if kx and ky are characteristic. Namely, if kx is the distance 
kernel induced by px and centred at xq, then k{{xo,y), {xo,y')) = for all 
y,y' € y- That means that for every two distinct Py,Qy £ A^+(3^), we 
have 7fc((5xo-Py , (^xoQy) = 0. Thus, given that px and py have strong neg- 
ative type, the kernel in (4.4) characterizes independence, but not equality 
of probability measures on the product space. Informally speaking, distin- 
guishing PxY from PxPy is an easier problem than two-sample testing on 
the product space. 

6. Empirical estimates and hypothesis tests. 

6.1. Two- sample testing. So far, we have seen that the population ex- 
pression of the MMD between P and Q is well defined as long as P and 

1/2 

Q lie in the space (-E), or, equivalently, have a finite half- moment 

w.r.t. semimetric p generated by k. However, this assumption will not suf- 
fice to establish a meaningful hypothesis test using empirical estimates of the 
MMD. We will require a stronger condition, that P,Q E M\{Z) n M\{Z) 
(which is the same condition under which the energy distance is well de- 
fined). Note that, under this condition we also have k G L'^p^p{Z x Z), as 
J Jk'^{z,z')dP{z)dP{z') < {Jk{z,z)dP{z))\ 

Given i.i.d. samples z = {zi}'^^ ~ P and w = {wi}^^^ ~ Q, the empirical 
(biased) V-statistic estimate of (2.3) is given by: 
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/ ^ m ^ n. 

V i=l 3 = 1 

^ m m iL II, 



„ m n 

(6.1) -— EE^(--^.)- 

Recall that if k generates p, this estimate involves only the pairwise p- 
distances between the sample points. 

We now describe a two-sample test using this statistic. The kernel kp 
centred at P in (2.12) plays a key role in characterizing the null distribution 
of degenerate V-statistic. To fcp, we associate the integral kernel operator 
Sj^^ : L],{Z) Ll,{Z) (cf. e.g., [Steinwart and Christmann, 2008, p.l26- 
127]), given by: 



(6.2) Sj^^g{z) = j^kp{z,w)g{w)dP{w). 

The condition that P € M.\.{Z), and, as a consequence, that kp G Lp^p(Zx 
Z), is closely related to the desired properties of the integral operator. 
Namely, this implies that S^^ is a trace class operator, and, thus, a Hilbert- 
Schmidt operator [Reed and Simon, 1980, Proposition VI. 23]. The following 
theorem is a special case of [Gretton et al., 2012, Theorem 12], where for 
simplicity, we focus on the case where m = n. 

Theorem 30. Let k be a kernel on Z, and let Z = {Zi}^^ and W = 
{^JIli be two i.i.d. samples from P € M\{Z) n Ml{Z). Then 



(6.3) Y^iy(Z,W) - ^A.7Vf, 

i=l 



where Ni ~ ' J\f{0, 1), i S N, and {Ajj^j^ are the eigenvalues of the operator 



^kp- 



Note that the limiting expression in (6.3) is a valid random variable pre- 
cisely since 5^ is Hilbert-Schmidt, i.e., since X^^i < oo. 
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6.2. Independence testing. In the case of independence testing, we are 
given i.i.d. samples z = {(xj, i/j)}™ ~ Pxy, and the resulting V-statistic 
estimate (HSIC) is [Gretton et al., 2005, 2008]: 

(6.4) HSIC{z; kx, ky) = -^Tr{KxHKyH), 

where Kx, Ky and H are m x m matrices given by {Kx)ij '■= kx{xi,Xj), 
{Ky)- := ky{yi,yj) and Hij = 6ij — ^ (centering matrix). The null dis- 
tribution of HSIC takes an analogous form to (6.3) of a weighted sum 
of chi-squares, but with coefficients corresponding to the products of the 
eigenvalues of integral operators Sr : L^p (A!) — Lp (X), and Sr. : 

L'j)^^{y) — >■ Lp^(3^). Similarly to the case of two-sample testing, we will 
require that Px € A4\^{X) and Py € ■^kyi^)' implying that integral op- 
erators 5^^ and are trace class operators. The following Theorem is 
from [Zhang et al., 2011, Theorem 4]. See also [Lyons, 2011, Remark 2.9]. 

Theorem 31. Let Z = {iXi,Yi)}'^-^ be an i.i.d. sample from Pxy = 
PxPy, with values inX xy, s.t. Px G Ml^{X) and Py G Mly{y). Then 

oo oo 

(6.5) mHSIC{Z;kx,ky) Y.Y.^^'^i^lv 

i=i j=i 

where Nij ~ Af{0, 1), i,j € N, are independent and {Xi}°Zi '^"'^ '^'^^ 
the eigenvalues of the operators S^^ and 5^^ , respectively. 

6.3. Test designs. We would like to design distance-based tests with an 
asymptotic Type I error of a, and thus we require an estimate of the (1 — a)- 
quantile of the null distribution. We investigate two approaches, both of 
which yield consistent tests: a bootstrap approach [Arcones and Gine, 1992], 
and a spectral approach [Gretton et al., 2009a, Zhang et al., 2011]. The lat- 
ter requires empirical computation of eigenvalues of kernel integral opera- 
tors, a problem studied extensively in the context of kernel PCA [Scholkopf et al., 
1997]. To estimate limiting distribution in (6.3), we compute the spectrum of 
the centred Gram matrix K = HKH on the aggregated samples. Here, K is 

a 2m X 2m matrix, with entries Kij = k{ui,Uj), u = [z w] is the concatena- 
tion of the two samples and H is the centering matrix. Gretton et al. [2009a] 
show that the null distribution defined using the finite sample estimates of 
these eigenvalues converges to the population distribution, provided that 
the spectrum is square-root summable. As demonstrated in [Gretton et al.. 
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2009a], spectral estimation of the test threshold has a smaller computa- 
tional cost of 0{m'^) than that of the bootstrap-based approach, which is 
C(m^), while providing an indistinguishable performance. The same ap- 
proach can be used in obtaining a consistent finite sample null distribution 
for HSIC, via computation of the empirical eigenvalues of = HK^H 
and ky = HKyH; see [Zhang et al., 2011]. 

Both Szekely and Rizzo [2004, p. 14] and Szekely et al. [2007, p. 2782- 
2783] establish that the energy distance and distance covariance statistics, 
respectively, converge to the weighted sums of chi-squares of forms similar 
to (6.3). Analogous results for the generalized distance covariance are pre- 
sented in [Lyons, 2011, p. 7-8]. These works do not propose test designs that 
attempt to estimate the coefficients Aj, i G N, however. Besides the boot- 
strap, Szekely et al. [2007, Theorem 6] also propose an independence test 
using a bound applicable to a general quadratic form Q of centered Gaus- 
sian random variables with E[Q] = 1: F {Q > - a/2)^)} < a, valid 
for < a < 0.215. When applied to the distance covariance statistic, the up- 
per bound of a is achieved if X and Y are independent Bernoulli variables. 
The authors remark that the resulting criterion might be over-conservative. 
Thus, more sensitive distance covariance tests are possible by computing the 
spectrum of the centred Gram matrices associated to distance kernels, and 
we pursue this approach in the next section. 

7. Experiments. 

7.1. Two-sample experiments. In the two-sample experiments, we inves- 
tigate three different kinds of synthetic data. In the first, we compare two 
multivariate Gaussians, where the means differ in one dimension only, and 
all variances are equal. In the second, we again compare two multivariate 
Gaussians, but this time with identical means in all dimensions, and vari- 
ance that differs in a single dimension. In our third experiment, we use the 
benchmark data of Sriperumbudur et al. [2009]: one distribution is a uni- 
variate Gaussian, and the second is a univariate Gaussian with a sinusoidal 
perturbation of increasing frequency (where higher frequencies correspond 
to harder problems). All tests use a distance kernel induced by the Euclidean 
distance. As shown on the left hand plots in Figure 3, the spectral and boot- 
strap test designs appear indistinguishable, and significantly outperform the 
test designed using the quadratic form bound, which appears to be far too 
conservative for the data sets considered. This is confirmed by checking the 
Type I error of the quadratic form test, which is significantly smaller than 
the desired test size of a = 0.05. 
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Figure 3. (left) MMD using Gaussian and distance kernels for various tests; (right) 
Spectral MMD using distance kernels with various exponents. 



We also compare the performance to that of the Gaussian kernel, with 
the bandwidth set to the median distance between points in the aggregation 
of samples. We see that when the means differ, both tests perform similarly. 
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When the variances differ, it is clear that the Gaussian kernel has a major 
advantage over the distance kernel, although this advantage decreases with 
increasing dimension (where both perform poorly). In the case of a sinusoidal 
perturbation, the performance is again very similar. 

In addition, following Example 11, we investigate performance of kernels 
obtained using the semimetric p{z,z') = \\z — z'W'' for < g < 2. Results 
are presented in the right hand plots of Figure 3. While we note that the 
judiciously chosen values of q offer some improvement in the cases of differing 
mean and variance, a dramatic improvement compared to the case q = 1 
and the Gaussian kernel is noticeable in the case of sinusoidal perturbation, 
where the values q = 1/3 (and smaller) offer virtually error- free performance 
even at high frequencies (note that q = 1 yields the energy distance described 
in [Szekely and Rizzo, 2004, 2005]). 

We observe from the simulation results that distance kernels with higher 
exponents are advantageous in cases where distributions differ in mean value 
along a single dimension (with noise in the remainder), whereas distance ker- 
nels with smaller exponents are more sensitive to differences in distributions 
at finer lengthscales (i.e., where the characteristic functions of the distribu- 
tions differ at higher frequencies). 

7.2. Independence experiments. To assess independence tests, we used an 
artificial benchmark proposed by Gretton et al. [2008]: we generated univari- 
ate random variables from the ICA benchmark densities of Bach and Jordan 
[2002]; rotated them in the product space by an angle between and 7r/4 to 
introduce dependence; filled additional dimensions with independent Gaus- 
sian noise; and, finally, passed the resulting multivariate data through ran- 
dom and independent orthogonal transformations. The resulting random 
variables X and Y were dependent but uncorrelated. The case m = 128 
(sample size) and d = 2 (dimension) is plotted in Figure 4 (left). As ob- 
served by Gretton et al. [2009b], the Gaussian kernel does better than the 
distance kernel with q = 1. By varying q, however, we are able to obtain a 
wide performance range: in particular, the values q = 1/3 (and smaller) have 
an advantage over the Gaussian kernel on this dataset. As for the two-sample 
case, bootstrap and spectral tests have indistinguishable performance, and 
are significantly more sensitive than the quadratic form-based test, which 
failed to detect any dependence on this dataset. 

In addition, we assess the performance on sinusoidally dependent data. 
The sample of the random variable pair X, Y was drawn from PxY cx: 1 + 
sin(^a;) sm{£y) for integer £, on the support X x where A! := [— vr, vr] and 
y := [— 7r,7r]. In this way, increasing i causes the departure from a uniform 
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(independent) distribution to occur at increasing frequencies, making this 
departure harder to detect given a small sample size. Results are in Figure 
4 (right). The distance covariance outperforms the Gaussian kernel on this 
example, and smaller exponents result in better performance (lower Type II 
error when the departure from independence occurs at higher frequencies). 
Finally, we note that the setting q = I, as described by Szekely et al. [2007], 
Szekely and Rizzo [2009], is a reasonable heuristic in practice, but does not 
yield the most powerful tests on either dataset. 



m=1 28, d=2, a=0.05 m=51 2, a=0.05 




angle of rotation (x n/4) frequency 



Figure 4. HSIC using distance kernels with various exponents and a Gaussian kernel as 
a function of (left) the angle of rotation for the dependence induced by rotation; (right) 
frequency £ in the sinusoidal dependence example. 



8. Conclusion. We have established an equivalence between the gen- 
eralized notions of energy distance and distance covariance, computed with 
respect to semimetrics of negative type, and RKHS measures of distance 
between distributions. In particular, energy distances and RKHS distance 
measures coincide when the semimetrics and kernels are related in a partic- 
ular way. 

The interpretation of the energy distance and distance covariance in an 
RKHS setting should be of considerable interest both to statisticians and 
machine learning researchers, since the associated kernels may be used much 
more widely: in conditional dependence testing and estimates of the chi- 
squared distance [Fukumizu et al., 2008], in Bayesian inference [Fukumizu et al., 
2011], in mixture density estimation [Sriperumbudur, 2011] and in other ma- 
chine learning applications. In particular, the link with kernels makes these 
applications of the energy distance immediate and straightforward. Finally, 
for problem settings defined most naturally in terms of distances, and where 
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these distances are of negative type, there is an interpretation in terms of 
reproducing kernels, and the learning machinery from the kernel literature 
can be brought to bear. 
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As described by Szekely et al. [2007], the notion of distance covariance 
extends naturally to that of distance variance V^(X) = V'^{X,X) and of 
distance correlation (by analogy with the Pearson product-moment correla- 
tion coefficient), 
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APPENDIX A: DISTANCE CORRELATION 




V{X)V{Y) > 0, 
V{X)V{Y) = 0. 
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Distance correlation also has a straightforward interpretation in terms of 
kernels, 



VHX,Y) 

v(x)v(y) 

iI{Pxy,PxPy) 
Ik {Pxx , Px Px )lk {Pyy ,PyPy) 

W^xyWhs 



W^xxWhs W^yyWhs 

where covariancG opGrcttor YlxY '• ^^kx — ^ '^ky ^ linear operator for wliich. 
{^XYf,9)n,^ = ExY [f{X)g{Y)]-Exf{X)EYg{Y) for ah / G 7^,.^ and g G 
Tikyj a-nd denotes the Hilbert-Schmidt norm Gretton et al. [2005]. It is 

clear that TZ is invariant to scaling {X,Y) 1— ?> {eX,€Y), e > 0, whenever the 
corresponding semimetrics are homogeneous, i.e., whenever px{€x,ex') = 
epx{x,x'), and similarly for py. Moreover, TZ is invariant to translations, 
(X, y) 1-^ {X + x' ,Y + y'), x' G X, y' £ y, whenever px and py are 
translation invariant. 

APPENDIX B: LINK WITH UNIVERSAL KERNELS 

We briefly remark on how our results on equivalent kernels relate to 
the notion of universal kernels on compact metric spaces in the sense of 
Steinwart and Christmann [2008, Definition 4.52]: 

Definition 32. A continuous kernel k on a compact metric space Z is 
said to be universal if its RKHS Tik is dense in the space C{Z) of continuous 
functions on Z, endowed with the uniform norm. 

The family of universal kernels includes the most popular choices in ma- 
chine learning literature including the Gaussian and the Laplacian kernel. 
The following characterization of universal kernels is due to Sriperumbudur et al. 
[2011]: 

Proposition 33. Let k be a continuous kernel on a compact metric 
space Z. Then, k is universal if and only if fik '■ M{Z) — t- 1-Lk is a vector 
space monomorphism, i.e., 

\\pf^{u)\\\^= I I k{z,z')du{z)dv{z')>Q \iu e M{Z)\{d}. 
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As a direct consequence, every universal kernel k is also characteristic, 
as /Xfc is, in particular, injective on the space of probability measures. Now, 
consider a kernel kf centered at / = /Xfc(z^) for some u G A4{Z), such that 
= 1. Then kj is no longer universal, since 



However, kj is still characteristic, as it is equivalent to k. This means that 
all kernels of the form (2.12), including the distance kernels, are examples 
of non-universal characteristic kernels, provided that they generate a semi- 
metric p of strong negative type. In particular, the kernel in (2.10) on a 
compact 2 C M'^ is a characteristic non-universal kernel for q < 2. This re- 
sult is of some interest to the machine learning community, as such kernels 
have typically been difficult to construct. For example, the two notions are 
known to be equivalent on the family of translation invariant kernels on 
[Sriperumbudur et al., 2011]. 
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